library(MASS)
library(GGally)
library(ggplot2)
library(tidyverse)
library(corrplot)
data(Boston)
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Variables scale quite differently.
# plot distributions and correlations of continuous variables
p <- Boston %>%
ggpairs(
mapping = aes(alpha=0.5),
lower = list(continuous = wrap("points", colour = "cornflower blue", size=0.3)),
upper = list(continuous = wrap("cor", size = 2)),
diag = list(continuous = wrap("densityDiag", colour = "cornflower blue"))) +
theme(axis.text.x = element_text(size=5),
axis.text.y = element_text(size=5))
p
There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. Also checking the correlations.
cor_matrix <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),tl.cex=0.7)
There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.
Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:
boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
sd(boston_sc$dis)
## [1] 1
Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.
Then we will create a categorical variable called crime which is based on the quantiles of crim variable:
brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim, breaks = brks, label = lbls, include.lowest = TRUE)
boston_sc$crime <- crime
boston_sc <- dplyr::select(boston_sc, -crim) # Remove the old crim variable
summary(boston_sc$crime)
## low med_low med_high high
## 127 126 126 127
Then let’s divide the data into training (80%) and testing sets (20%):
n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]
Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.
Next I estimate a linear discriminant model with crime as the target variable. The purpose of the analysis is to identify those variables that explain whether a tract has a high or low crime rate. We have here a four-class grouping of crime rate.
A linear discriminant model assumes that explanatory variables are continuous and normally distributed given the classes defined by the target variable. Moreover, a constant variance across the explanatory variables is assumed. According to the preliminary analysis, the assumption of normality is not satisfied. I do not check whether the assumption is satisfied given the crime class but simply assume normality. The constant variance assumption is satisfied because of scaling.
The results from linear discriminant analysis are shown below. The prior probabilities show the proportions of observations belonging to the four groups in the train data. They are not exactly equal because the grouping was done with all the 506 tracts. The variable means differ across crime groups suggesting that they have an association with the crime rate.
The first linear discriminant explains 95% of the variance between the groups based on crime rate.
lda_fit <- lda(crime ~ ., data = train_set)
lda_fit
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2698020 0.2425743 0.2178218 0.2698020
##
## Group means:
## zn indus chas nox rm age
## low 0.8873851 -0.9055876 -0.163968511 -0.8563255 0.41868172 -0.8558901
## med_low -0.1733183 -0.2330967 0.008892378 -0.5657986 -0.15038773 -0.3508825
## med_high -0.4170776 0.2713442 0.309288013 0.4379456 0.07981804 0.4569088
## high -0.4872402 1.0169738 -0.055607954 1.0231636 -0.47859267 0.8072208
## dis rad tax ptratio black lstat
## low 0.8526534 -0.6763157 -0.7231059 -0.397718176 0.37637602 -0.74695791
## med_low 0.3289085 -0.5564696 -0.4712364 -0.006797986 0.32107277 -0.13552006
## med_high -0.3660002 -0.3893666 -0.2626795 -0.215662347 0.06925054 0.04360256
## high -0.8749854 1.6395837 1.5150965 0.782471277 -0.82715139 0.94726249
## medv
## low 0.470654841
## med_low -0.001237102
## med_high 0.111587797
## high -0.694051625
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.03198102 0.7549582724 -0.87264330
## indus 0.05562842 -0.4135340867 0.38521445
## chas -0.06437677 -0.1265580100 0.07399599
## nox 0.34801587 -0.6134364549 -1.44896506
## rm -0.08329676 -0.1342096938 -0.22386991
## age 0.22099856 -0.3627396371 -0.26208070
## dis -0.03361647 -0.3966916356 0.02597317
## rad 3.15590347 0.9503847172 -0.03347636
## tax 0.04789528 -0.0362850360 0.49294637
## ptratio 0.08624829 -0.0006524072 -0.20503229
## black -0.11623554 0.0093251572 0.07659004
## lstat 0.27658104 -0.2687005785 0.35890290
## medv 0.21390952 -0.4036817063 -0.08914383
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9496 0.0372 0.0132
The LDA biplot based on the estimated model is shown below. The observations are colored on the basis of their crime group. The arrows indicate that variable rad (index of accessibility to radial highways) is a strong predictor of linear discriminant 1, while variables nox (air pollution) and zn (prop. of residential land zoned for large lots) explain linear discriminant 2.
# the function for lda biplot arrows
lda_arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train_set$crime)
colnames(train_set)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
# plot the lda results
plot(lda_fit, dimen = 2, col=classes, pch=classes)
lda_arrows(lda_fit, myscale = 2)
I use the test data to validate the model, ie. to see whether the observations in the test data are correctly classified.
# save the correct classes from test data
correct_classes <- test_set$crime
# remove the crime variable from test data
test <- dplyr::select(test_set, -crime)
The table below shows (after a little calculation) that roughly 1/3 of observations lie outside the diagonal i.e. are incorrectly predicted by the model. (if prop.table used).
If addmargins() used: It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.
# predict classes with test data
lda_pred <- predict(lda_fit, newdata = test)
# Confusion Matrix and Accuracy
tab1 <- table(correct = correct_classes, predicted = lda_pred$class) %>% addmargins() #%>% prop.table
accuracy1 <- sum(diag(tab1))/sum(tab1)
accuracy1
## [1] 0.4142157
As a final step, I run a K-means clustering analysis with Boston data set. K-means clustering divides observations into pre-defined number of clusters (K), by minimizing the distance of observations to cluster means (centroids). I first look at the distances between observations, using a popular distance measure, Euclidean distance.
library(MASS)
data('Boston')
# remove variable chas
#Boston <- Boston %>% dplyr::select(-chas)
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
I first choose three clusters (K=3). The plot below suggests that variables black, and tax have a strong association with clusters. Many of the other variables seem to have a somewhat weaker effect, too.
# k-means clustering
km <- kmeans(Boston, centers = 2)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[1:10], col = km$cluster)
pairs(boston_scaled[4:6], col = km$cluster)
pairs(boston_scaled[7:9], col = km$cluster)
pairs(boston_scaled[10:12], col = km$cluster)
pairs(boston_scaled[13:14], col = km$cluster)
I search for optimal number of clusters K by inspecting how the total of within cluster sum of squares (total WCSS) changes when K changes. I let K run from 1 to 10. The optimal number of clusters is the value of K where the total WCSS drops rapidly.
The plot below shows that the optimal number of clusters is 2.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
I choose K=2 and re-run the K-means clustering algoritm. The plot below gives support to dividing this data set to two clusters. Twcss = total within cluster sum of square
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with 2 clusters
pairs(boston_scaled[1:6], col = km$cluster)
pairs(boston_scaled[7:14], col = km$cluster)
data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda_arrows(boston_lda, myscale=2)
It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
summary(model_predictors)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.55630 Min. :-0.27233 Min. :-1.464433
## 1st Qu.:-0.48724 1st Qu.:-0.86683 1st Qu.:-0.27233 1st Qu.:-0.920756
## Median :-0.48724 Median :-0.18028 Median :-0.27233 Median :-0.169964
## Mean :-0.02493 Mean : 0.03261 Mean : 0.01029 Mean : 0.003159
## 3rd Qu.:-0.48724 3rd Qu.: 1.01499 3rd Qu.:-0.27233 3rd Qu.: 0.613189
## Max. : 3.80047 Max. : 2.42017 Max. : 3.66477 Max. : 2.729645
## rm age dis rad
## Min. :-3.87641 Min. :-2.333128 Min. :-1.265817 Min. :-0.98187
## 1st Qu.:-0.58052 1st Qu.:-0.813528 1st Qu.:-0.843952 1st Qu.:-0.63733
## Median :-0.15604 Median : 0.297529 Median :-0.261737 Median :-0.52248
## Mean :-0.03526 Mean : 0.001279 Mean :-0.005963 Mean : 0.04009
## 3rd Qu.: 0.47482 3rd Qu.: 0.900573 3rd Qu.: 0.674147 3rd Qu.: 1.65960
## Max. : 3.55153 Max. : 1.116390 Max. : 3.284050 Max. : 1.65960
## tax ptratio black lstat
## Min. :-1.30676 Min. :-2.70470 Min. :-3.90333 Min. :-1.50301
## 1st Qu.:-0.76237 1st Qu.:-0.48756 1st Qu.: 0.19756 1st Qu.:-0.76397
## Median :-0.39895 Median : 0.29768 Median : 0.38371 Median :-0.18527
## Mean : 0.04215 Mean : 0.05518 Mean :-0.02865 Mean : 0.03067
## 3rd Qu.: 1.52941 3rd Qu.: 0.80578 3rd Qu.: 0.44062 3rd Qu.: 0.62343
## Max. : 1.79642 Max. : 1.63721 Max. : 0.44062 Max. : 3.54526
## medv
## Min. :-1.90634
## 1st Qu.:-0.63420
## Median :-0.17210
## Mean :-0.03627
## 3rd Qu.: 0.24651
## Max. : 2.98650
summary(lda_fit$scaling)
## LD1 LD2 LD3
## Min. :-0.11624 Min. :-0.6134365 Min. :-1.44897
## 1st Qu.:-0.03362 1st Qu.:-0.3966916 1st Qu.:-0.22387
## Median : 0.05563 Median :-0.1342097 Median :-0.03348
## Mean : 0.31843 Mean :-0.0801401 Mean :-0.13243
## 3rd Qu.: 0.22100 3rd Qu.:-0.0006524 3rd Qu.: 0.07659
## Max. : 3.15590 Max. : 0.9503847 Max. : 0.49295
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)
summary(matrix_product)
## LD1 LD2 LD3
## Min. :-4.4002 Min. :-3.37037 Min. :-3.37316
## 1st Qu.:-2.6131 1st Qu.:-0.62138 1st Qu.:-0.48630
## Median :-1.6335 Median : 0.06645 Median : 0.12156
## Mean : 0.1422 Mean : 0.01376 Mean : 0.05807
## 3rd Qu.: 5.6791 3rd Qu.: 0.58766 3rd Qu.: 0.62630
## Max. : 7.2669 Max. : 3.34902 Max. : 2.47373
# install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=classes, size=I(40))
train_km <- kmeans(train_set[,-14],centers=4)
cluster_col <- train_km$cluster
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col, size=I(40))
The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.